Add support for temporary output paths #3818

bentsherman · 2023-03-31T21:49:06Z

Closes #452

Adds a temporary option to path outputs. See the docs and e2e test for details.

Notes:

It is currently a coarse-grained approach. A temp file is deleted when all consuming processes of the file's originating process are finished. As a result, temp files may not be deleted as soon as possible. I will investigate more fine-grained approaches, like tracking consumers at the level of channels or tasks, but they will be more complex. Temp file lifetimes are now determined by downstream tasks.
There is no additional validation on resumed runs. I empty the file contents and preserve the metadata, so temp files can be cached, but I do not try to verify that all downstream tasks are also cached. Again, I will investigate it, but IMO it is primarily valuable during development and not so much in production.
~~Directories and remote paths (e.g. S3) aren't supported yet. I'm working on it!~~ All paths are supported now.

Here is how you can test this feature using the e2e test:

$ rm -rf work ; ./launch.sh tests/temporary-outputs.nf 
N E X T F L O W  ~  version 23.03.0-edge
Launching `tests/temporary-outputs.nf` [grave_miescher] DSL2 - revision: de1ab365d0
executor >  local (9)
[ac/b7c06a] process > foo (1) [100%] 3 of 3 ✔
[4a/3bcb01] process > bar (3) [100%] 3 of 3 ✔
[77/fbd3d2] process > baz (3) [100%] 3 of 3 ✔

$ for file in `find work -type f -not -type l -name '*.txt'` ; do echo $file ; cat $file ; done
work/d6/14e2824f038602ebad8af4deb1fc16/c.txt
foo was here
bar was here
baz was here
work/d0/7441dc2e8fde0d279b48bd144f44f0/a.txt
work/77/fbd3d28c610f04701ed9a036b408c4/c.txt
foo was here
bar was here
baz was here
work/4a/3bcb018caccd9c2ca49ef67d362722/b.txt
work/3a/c0a004a645c35c7ed4baedc54e8470/b.txt
work/ac/b7c06ab493f58ef1556cc8081ecb26/a.txt
work/94/f31f20d98868f9d056dc1bd356be9a/c.txt
foo was here
bar was here
baz was here
work/65/b602b5f2bdfef95a61a163d33aa924/a.txt
work/22/aad7d4b922acadc15598484f0ffa63/b.txt

$ ./launch.sh tests/temporary-outputs.nf -resume
N E X T F L O W  ~  version 23.03.0-edge
Launching `tests/temporary-outputs.nf` [cranky_boltzmann] DSL2 - revision: de1ab365d0
[ac/b7c06a] process > foo (1) [100%] 3 of 3, cached: 3 ✔
[4a/3bcb01] process > bar (2) [100%] 3 of 3, cached: 3 ✔
[77/fbd3d2] process > baz (3) [100%] 3 of 3, cached: 3 ✔

After the first run, we inspect the work directory and find that all of the output files that were declared with temporary: true in the pipeline are now empty. On a resumed run, everything is cached. In this case, a run can be safely resumed as long as all of the baz can be cached. If you modify the baz process or delete/modify any of the c.txt files, resuming the run will produce incorrect output.

Signed-off-by: Ben Sherman <[email protected]>

bentsherman · 2023-03-31T22:16:27Z

Some additional thoughts:

I considered improving the cleaner by tracking the channel consumers rather than process consumers, but now I'm not sure that would be safe.

Consider a process that declares two output channels A and B, A is temporary, and they capture the same files. B's files will be deleted even though it isn't declared temporary -- that is a caveat unto itself. If I track consumers at the channel level rather than process level, I only have to wait for A's consumers to finish. But really I need to wait for B's consumers too, because they're using the same files.
While directories aren't supported, you can probably make it work by emitting the actual list of files in the directory in your pipeline script. I should be able to make it work directly by walking the directory.
Remote paths aren't supported ATM because I haven't figured out how to "empty" a file through the Path API. I can delete the contents and reset the modified timestamp, but I haven't found a way to reset the file size. If someone figures out how to do it, please let me know!

In the meantime, you can probably make it work with remote paths by using Fusion 😄

…emote paths) Signed-off-by: Ben Sherman <[email protected]>

ewels · 2023-04-01T05:36:15Z

This is awesome! 👏🏻 😎

Is there any need to mark outputs as temporary? If we're using publishDir, is it not true that all work dir files are temporary? I had envisaged this working a bit like cleanup = true, but just cleaning as you go along.. Doing it without this statement would be preferable as then people could opt in to it without needing to make any pipeline edits. Reading the upstream issues, I guess that this option is modelled on Snakemake - but Snakemake doesn't have work directories and publish directories, so it needs this to know which files can be discarded. I don't think that we do.

Following the cleanup = true analogy - how realistic is it to try to keep resume functionality working? For example, if Process A files are deleted, then Process B is edited and the pipeline is rerun won't it fail in weird and unhandled ways? [edit: Just saw that you mentioned this in the docs, so yes] I just wonder if this complicates the matter a lot and we could get away with just completely removing the work directory with less hassle 😬

mbosio85 · 2023-04-03T08:19:51Z

@bentsherman I have a couple of questions, how would this act on those files that are generated by a process but are not defined as output:? Are they affected by this PR or they stay in the workdir?

I understand that all those intermediate files which are not used downstream can be safely removed upon the process completion, without affecting the resume functionality.

bentsherman · 2023-04-03T16:25:00Z

@ewels That's a fair point, it would be nice to simply say cleanup = true and let Nextflow figure out which files are temporary. Indeed, as long as outputs aren't published via symlink, everything in the work directory can be deleted.

Hmm, it seems that while the task outputs are published before the "on task complete" / "on process terminate" events are sent (which the temp file cleaner uses to trigger cleanup), publishing is asynchronous, so we would also need an event for when publishing is complete for a given task / process.

In any case, I think I will get rid of the "empty file" trick and just delete the file, much easier to support directories and object storage that way. I don't think it's strictly needed for resumability. But I would like to know how important resumability is to the community. Paolo says he really wants it, but I suspect that in production, where the automatic cleanup is most useful, being able to resume is not as important because you aren't fixing bugs, etc.

If the task cache has all the necessary information, and if we can distinguish between a task that was deleted vs a task that was modified, then the resume should be able to skip tasks that were deleted (but not otherwise modified) as long as downstream tasks are also cached. But I think we could go ahead and ship a basic automatic cleanup feature, then try to add resumability in a separate PR.

ewels · 2023-04-03T16:26:49Z

Yup, agree on all points 👍🏻

bentsherman · 2023-04-03T16:27:19Z

@mbosio85 If an output file isn't captured by an output: declaration, it can safely be deleted when that task completes. This happens by default when using process.scratch=true, because the file is never copied back to the shared work directory. If someone isn't use scratch, they can explicitly delete the file in the process script to free up space in their work directory.

Signed-off-by: Ben Sherman <[email protected]>

bentsherman · 2023-04-28T17:23:25Z

I figured out how to track the downstream tasks of a temporary output -- when a process "closes" (i.e. all tasks have been created), we can inspect which tasks use a temporary file and be certain that we found all of them. So now each temp file can be deleted much sooner.

On top of that, we can save the list of downstream tasks for each task and use it during the resume. To do this, we have to compute separately the task "inputs" hash and task "outputs" hash. The inputs/script/config of a task must be cached no matter what, but if any outputs are missing, we can traverse the list of downstream tasks and see if they are cached. We can traverse the entire task dependency graph, as long as we end up at leaf nodes that are cached.

Basically, we want the .nextflow to store all the key information encoded in the work directory, so that you could delete entire task directories and still recover them from the .nextflow cache as well as downstream tasks.

These ideas should all apply to the global cleanup option as well, including resumability. But with the global cleanup we also have to wait for files to be published. So I'm going to explore the resumability in this PR for now, and then I will translate it to the "eager" cleanup PR. The end goal is to have the global cleanup with resumability, then I think we'll be good to go.

modules/nextflow/src/main/groovy/nextflow/trace/TraceObserver.groovy

Signed-off-by: Ben Sherman <[email protected]>

fgualdr · 2023-05-10T08:15:15Z

This feature will be awesome!
We are dealing with this issue in productivity when space fills up of intermediate files.
When will be released ... we are all thrilled ... and desperate... :-)

bentsherman · 2023-07-07T23:25:30Z

The "eager" cleanup PR now has the same capabilities as this one. In particular, it can eagerly delete individual output files in addition to task directories. This piece was important because output files can often times be deleted sooner than task directories. I was going to implement resumability here first and then port it to the other PR, but now we can cut out the middle man 😄

On to resumability...

Closing in favor of #3849 .

bentsherman added 8 commits March 27, 2023 14:01

Add initial task graph and metadata json file

47d0168

Signed-off-by: Ben Sherman <[email protected]>

Add task inputs and outputs to conrete DAG

ae67027

Signed-off-by: Ben Sherman <[email protected]>

Fix failing tests

8f95cd6

Signed-off-by: Ben Sherman <[email protected]>

Use path-based APIs to get file metadata

9f11e4b

Signed-off-by: Ben Sherman <[email protected]>

Merge branch 'master' into ben-task-graph

db6aed1

Signed-off-by: Ben Sherman <[email protected]>

Use buffer to compute checksum

8456892

Signed-off-by: Ben Sherman <[email protected]>

Add support for temporary output paths

77f2cdc

Signed-off-by: Ben Sherman <[email protected]>

Fix failing test

3e55ad5

Signed-off-by: Ben Sherman <[email protected]>

bentsherman added the lang/processes label Mar 31, 2023

bentsherman requested a review from pditommaso March 31, 2023 21:49

bentsherman linked an issue Mar 31, 2023 that may be closed by this pull request

Automatically delete files marked as temp as soon as not needed anymore #452

Open

bentsherman mentioned this pull request Mar 31, 2023

Automatically delete files marked as temp as soon as not needed anymore #452

Open

Add caveat about overlapping output channels [ci skip]

e307f75

Signed-off-by: Ben Sherman <[email protected]>

Delete files instead of emptying them (now supports directories and r…

08881b0

…emote paths) Signed-off-by: Ben Sherman <[email protected]>

bentsherman mentioned this pull request Apr 6, 2023

Work directories not clearing after jobs complete, causing maxed out vm storage before pipeline completes #3838

Closed

pditommaso marked this pull request as draft April 9, 2023 14:38

bentsherman mentioned this pull request Apr 10, 2023

Automatic task cleanup #3849

Draft

5 tasks

bentsherman mentioned this pull request Apr 21, 2023

Job arrays #3892

Merged

12 tasks

bentsherman added 6 commits April 21, 2023 11:27

Merge branch 'master' into ben-task-graph-pull

0dd98d6

Merge branch 'master' into ben-task-graph-pull

0f505d3

Signed-off-by: Ben Sherman <[email protected]>

Replace synchronized with lock

e81e584

Signed-off-by: Ben Sherman <[email protected]>

Merge branch 'ben-task-graph-pull' into 452-temporary-outputs

e57b240

Signed-off-by: Ben Sherman <[email protected]>

Use lock in temp file observer

6fa9e92

Signed-off-by: Ben Sherman <[email protected]>

Improve tracking of temporary files

f46f506

Signed-off-by: Ben Sherman <[email protected]>

sonatype-lift bot reviewed Apr 28, 2023

View reviewed changes

modules/nextflow/src/main/groovy/nextflow/trace/TraceObserver.groovy Show resolved Hide resolved

bentsherman changed the base branch from ben-task-graph to master April 28, 2023 19:16

bentsherman force-pushed the 452-temporary-outputs branch from cf5f8df to 9760f4f Compare April 28, 2023 19:26

Remove dependency on task graph branch

9637e34

Signed-off-by: Ben Sherman <[email protected]>

bentsherman force-pushed the 452-temporary-outputs branch from 9760f4f to 9637e34 Compare April 28, 2023 20:15

bentsherman mentioned this pull request May 10, 2023

Azure jobs correctly deleted after completion #3927

Merged

pditommaso force-pushed the master branch from 68c35f1 to 36b9e22 Compare June 4, 2023 20:18

pditommaso force-pushed the master branch from 38c9931 to 295bc1f Compare June 13, 2023 08:44

bentsherman closed this Jul 7, 2023

bentsherman deleted the 452-temporary-outputs branch October 4, 2023 17:36

slsevilla mentioned this pull request Feb 26, 2024

consider renaming workdir to .work so it will be hidden CCBR/CHAMPAGNE#189

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for temporary output paths #3818

Add support for temporary output paths #3818

bentsherman commented Mar 31, 2023 •

edited

Loading

bentsherman commented Mar 31, 2023

ewels commented Apr 1, 2023 •

edited

Loading

mbosio85 commented Apr 3, 2023

bentsherman commented Apr 3, 2023

ewels commented Apr 3, 2023

bentsherman commented Apr 3, 2023

bentsherman commented Apr 28, 2023

fgualdr commented May 10, 2023

bentsherman commented Jul 7, 2023

Add support for temporary output paths #3818

Add support for temporary output paths #3818

Conversation

bentsherman commented Mar 31, 2023 • edited Loading

bentsherman commented Mar 31, 2023

ewels commented Apr 1, 2023 • edited Loading

mbosio85 commented Apr 3, 2023

bentsherman commented Apr 3, 2023

ewels commented Apr 3, 2023

bentsherman commented Apr 3, 2023

bentsherman commented Apr 28, 2023

fgualdr commented May 10, 2023

bentsherman commented Jul 7, 2023

bentsherman commented Mar 31, 2023 •

edited

Loading

ewels commented Apr 1, 2023 •

edited

Loading